Load libraries:
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## Warning: package 'tibble' was built under R version 3.6.2
## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(modelr)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Load dataset and examine it:
avocados <- clean_names(read_csv("data/avocado.csv"))
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## Date = col_date(format = ""),
## AveragePrice = col_double(),
## `Total Volume` = col_double(),
## `4046` = col_double(),
## `4225` = col_double(),
## `4770` = col_double(),
## `Total Bags` = col_double(),
## `Small Bags` = col_double(),
## `Large Bags` = col_double(),
## `XLarge Bags` = col_double(),
## type = col_character(),
## year = col_double(),
## region = col_character()
## )
summary(avocados)
## x1 date average_price total_volume
## Min. : 0.00 Min. :2015-01-04 Min. :0.440 Min. : 85
## 1st Qu.:10.00 1st Qu.:2015-10-25 1st Qu.:1.100 1st Qu.: 10839
## Median :24.00 Median :2016-08-14 Median :1.370 Median : 107377
## Mean :24.23 Mean :2016-08-13 Mean :1.406 Mean : 850644
## 3rd Qu.:38.00 3rd Qu.:2017-06-04 3rd Qu.:1.660 3rd Qu.: 432962
## Max. :52.00 Max. :2018-03-25 Max. :3.250 Max. :62505647
## x4046 x4225 x4770 total_bags
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 854 1st Qu.: 3009 1st Qu.: 0 1st Qu.: 5089
## Median : 8645 Median : 29061 Median : 185 Median : 39744
## Mean : 293008 Mean : 295155 Mean : 22840 Mean : 239639
## 3rd Qu.: 111020 3rd Qu.: 150207 3rd Qu.: 6243 3rd Qu.: 110783
## Max. :22743616 Max. :20470573 Max. :2546439 Max. :19373134
## small_bags large_bags x_large_bags type
## Min. : 0 Min. : 0 Min. : 0.0 Length:18249
## 1st Qu.: 2849 1st Qu.: 127 1st Qu.: 0.0 Class :character
## Median : 26363 Median : 2648 Median : 0.0 Mode :character
## Mean : 182195 Mean : 54338 Mean : 3106.4
## 3rd Qu.: 83338 3rd Qu.: 22029 3rd Qu.: 132.5
## Max. :13384587 Max. :5719097 Max. :551693.7
## year region
## Min. :2015 Length:18249
## 1st Qu.:2015 Class :character
## Median :2016 Mode :character
## Mean :2016
## 3rd Qu.:2017
## Max. :2018
head(avocados)
avocados %>%
distinct(region) %>%
summarise(number_of_regions = n())
avocados %>%
distinct(date) %>%
summarise(
number_of_dates = n(),
min_date = min(date),
max_date = max(date)
)
The x1 variable is related to the database, while region and date are going to create a lot of dummy variables, so let’s get rid of them. We should also treat year as a factor and not numerical, and we might as well make sure type is a factor too:
trimmed_avocados <- avocados %>%
select(-c("x1", "date", "region")) %>%
mutate(
year = as_factor(year),
type = as_factor(type)
)
Now let’s check for aliased variables (i.e. combinations of variables in which one or more of the variables can be calculated exactly from other variables):
alias(average_price ~ ., data = trimmed_avocados )
## Model :
## average_price ~ total_volume + x4046 + x4225 + x4770 + total_bags +
## small_bags + large_bags + x_large_bags + type + year
Nice, we don’t find any aliases.
Run ggpairs() on the remaining variables:
ggpairs(trimmed_avocados)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s save that plot so we can zoom in on it more easily
ggsave("pairs_plot_choice1.png", width = 10, height = 10, units = "in")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Test competing models with x4046 and type:
model1a <- lm(average_price ~ x4046, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model1a)
summary(model1a)
##
## Call:
## lm(formula = average_price ~ x4046, data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98539 -0.29842 -0.03531 0.25459 1.82475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.425e+00 2.993e-03 476.29 <2e-16 ***
## x4046 -6.631e-08 2.305e-09 -28.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3939 on 18247 degrees of freedom
## Multiple R-squared: 0.0434, Adjusted R-squared: 0.04334
## F-statistic: 827.8 on 1 and 18247 DF, p-value: < 2.2e-16
model1b <- lm(average_price ~ type, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model1b)
summary(model1b)
##
## Call:
## lm(formula = average_price ~ type, data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21400 -0.20400 -0.02804 0.18600 1.59600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.158040 0.003321 348.7 <2e-16 ***
## typeorganic 0.495959 0.004697 105.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3173 on 18247 degrees of freedom
## Multiple R-squared: 0.3793, Adjusted R-squared: 0.3792
## F-statistic: 1.115e+04 on 1 and 18247 DF, p-value: < 2.2e-16
model1b is best, so we’ll keep that and re-run ggpairs() with the residuals.
avocados_remaining_resid <- trimmed_avocados %>%
add_residuals(model1b) %>%
select(-c("average_price", "type"))
ggpairs(avocados_remaining_resid)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave("pairs_plot_choice2.png", width = 10, height = 10, units = "in")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looks like x4046 and year are our next strong contenders:
model2a <- lm(average_price ~ type + x4046, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model2a)
summary(model2a)
##
## Call:
## lm(formula = average_price ~ type + x4046, data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21416 -0.20029 -0.02736 0.18591 1.59589
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.171e+00 3.485e-03 336.13 <2e-16 ***
## typeorganic 4.827e-01 4.802e-03 100.52 <2e-16 ***
## x4046 -2.323e-08 1.898e-09 -12.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.316 on 18246 degrees of freedom
## Multiple R-squared: 0.3843, Adjusted R-squared: 0.3843
## F-statistic: 5695 on 2 and 18246 DF, p-value: < 2.2e-16
model2b <- lm(average_price ~ type + year, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model2b)
summary(model2b)
##
## Call:
## lm(formula = average_price ~ type + year, data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32320 -0.18722 -0.01722 0.18278 1.66337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.127645 0.004704 239.735 < 2e-16 ***
## typeorganic 0.495980 0.004563 108.685 < 2e-16 ***
## year2016 -0.036995 0.005817 -6.360 2.07e-10 ***
## year2017 0.139580 0.005790 24.107 < 2e-16 ***
## year2018 -0.028104 0.009499 -2.959 0.00309 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3082 on 18244 degrees of freedom
## Multiple R-squared: 0.4142, Adjusted R-squared: 0.4141
## F-statistic: 3225 on 4 and 18244 DF, p-value: < 2.2e-16
So model2b comes out as slightly better here. Everything looks good so far, all coefficients are significant at \(0.05\) level and the diagnostics are acceptable.
avocados_remaining_resid <- trimmed_avocados %>%
add_residuals(model2b) %>%
select(-c("average_price", "type", "year"))
ggpairs(avocados_remaining_resid)
ggsave("pairs_plot_choice3.png", width = 10, height = 10, units = "in")
The next contender variable looks to be x4046. Let’s try it out.
model3 <- lm(average_price ~ type + year + x4046, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model3)
summary(model3)
##
## Call:
## lm(formula = average_price ~ type + year + x4046, data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32327 -0.18629 -0.01406 0.18370 1.66375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.141e+00 4.808e-03 237.366 < 2e-16 ***
## typeorganic 4.827e-01 4.664e-03 103.492 < 2e-16 ***
## year2016 -3.776e-02 5.792e-03 -6.518 7.31e-11 ***
## year2017 1.392e-01 5.765e-03 24.147 < 2e-16 ***
## year2018 -2.692e-02 9.459e-03 -2.846 0.00443 **
## x4046 -2.319e-08 1.844e-09 -12.575 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3069 on 18243 degrees of freedom
## Multiple R-squared: 0.4192, Adjusted R-squared: 0.4191
## F-statistic: 2634 on 5 and 18243 DF, p-value: < 2.2e-16
Everything still looks OK, perhaps some mild heteroscedasticity. All coefficients remain significant.
avocados_remaining_resid <- trimmed_avocados %>%
add_residuals(model3) %>%
select(-c("average_price", "type", "year", "x4046"))
ggpairs(avocados_remaining_resid)
We’re definitely now chasing variables with rather poor predictive power, so let’s try one more: x4225 looks to be our best choice.
model4 <- lm(average_price ~ type + year + x4046 + x4225, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model4)
summary(model4)
##
## Call:
## lm(formula = average_price ~ type + year + x4046 + x4225, data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32472 -0.18429 -0.01764 0.17756 1.66558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.135e+00 4.785e-03 237.161 < 2e-16 ***
## typeorganic 4.878e-01 4.636e-03 105.199 < 2e-16 ***
## year2016 -3.887e-02 5.747e-03 -6.764 1.38e-11 ***
## year2017 1.418e-01 5.721e-03 24.779 < 2e-16 ***
## year2018 -2.319e-02 9.386e-03 -2.471 0.0135 *
## x4046 -9.814e-08 4.730e-09 -20.751 < 2e-16 ***
## x4225 8.552e-08 4.976e-09 17.186 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3045 on 18242 degrees of freedom
## Multiple R-squared: 0.4285, Adjusted R-squared: 0.4283
## F-statistic: 2279 on 6 and 18242 DF, p-value: < 2.2e-16
All coefficients remain significant, and diagnostics look fine, but we are achieving only marginal gains in predictive power of the model.
Let’s now think about possible pair interactions, for four main effect variabled we have six possible pair interactions. Let’s test them out.
model4pa <- lm(average_price ~ type + year + x4046 + x4225 + type:year, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model4pa)
summary(model4pa)
##
## Call:
## lm(formula = average_price ~ type + year + x4046 + x4225 + type:year,
## data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29520 -0.18338 -0.01407 0.17136 1.67835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.084e+00 5.829e-03 186.000 < 2e-16 ***
## typeorganic 5.889e-01 8.156e-03 72.202 < 2e-16 ***
## year2016 2.432e-02 8.076e-03 3.011 0.0026 **
## year2017 2.218e-01 8.042e-03 27.583 < 2e-16 ***
## year2018 6.052e-02 1.319e-02 4.587 4.53e-06 ***
## x4046 -9.978e-08 4.705e-09 -21.208 < 2e-16 ***
## x4225 8.747e-08 4.950e-09 17.672 < 2e-16 ***
## typeorganic:year2016 -1.264e-01 1.142e-02 -11.072 < 2e-16 ***
## typeorganic:year2017 -1.601e-01 1.137e-02 -14.075 < 2e-16 ***
## typeorganic:year2018 -1.673e-01 1.865e-02 -8.968 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3025 on 18239 degrees of freedom
## Multiple R-squared: 0.4358, Adjusted R-squared: 0.4355
## F-statistic: 1565 on 9 and 18239 DF, p-value: < 2.2e-16
model4pb <- lm(average_price ~ type + year + x4046 + x4225 + type:x4046, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model4pb)
summary(model4pb)
##
## Call:
## lm(formula = average_price ~ type + year + x4046 + x4225 + type:x4046,
## data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33603 -0.18486 -0.01597 0.17871 1.65780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.136e+00 4.766e-03 238.375 < 2e-16 ***
## typeorganic 5.004e-01 4.726e-03 105.887 < 2e-16 ***
## year2016 -4.101e-02 5.725e-03 -7.164 8.14e-13 ***
## year2017 1.396e-01 5.700e-03 24.493 < 2e-16 ***
## year2018 -2.499e-02 9.347e-03 -2.673 0.00751 **
## x4046 -9.945e-08 4.711e-09 -21.111 < 2e-16 ***
## x4225 8.736e-08 4.957e-09 17.621 < 2e-16 ***
## typeorganic:x4046 -1.696e-06 1.352e-07 -12.546 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3032 on 18241 degrees of freedom
## Multiple R-squared: 0.4334, Adjusted R-squared: 0.4331
## F-statistic: 1993 on 7 and 18241 DF, p-value: < 2.2e-16
model4pc <- lm(average_price ~ type + year + x4046 + x4225 + type:x4225, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model4pc)
summary(model4pc)
##
## Call:
## lm(formula = average_price ~ type + year + x4046 + x4225 + type:x4225,
## data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33014 -0.18384 -0.01568 0.17885 1.66321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.134e+00 4.779e-03 237.320 < 2e-16 ***
## typeorganic 4.957e-01 4.759e-03 104.171 < 2e-16 ***
## year2016 -3.821e-02 5.739e-03 -6.657 2.87e-11 ***
## year2017 1.423e-01 5.714e-03 24.898 < 2e-16 ***
## year2018 -2.198e-02 9.374e-03 -2.344 0.0191 *
## x4046 -9.937e-08 4.726e-09 -21.025 < 2e-16 ***
## x4225 8.709e-08 4.974e-09 17.509 < 2e-16 ***
## typeorganic:x4225 -5.047e-07 6.970e-08 -7.241 4.63e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.304 on 18241 degrees of freedom
## Multiple R-squared: 0.4301, Adjusted R-squared: 0.4299
## F-statistic: 1967 on 7 and 18241 DF, p-value: < 2.2e-16
model4pd <- lm(average_price ~ type + year + x4046 + x4225 + year:x4046, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model4pd)
summary(model4pd)
##
## Call:
## lm(formula = average_price ~ type + year + x4046 + x4225 + year:x4046,
## data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32301 -0.18379 -0.01753 0.17736 1.66404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.137e+00 4.841e-03 234.779 < 2e-16 ***
## typeorganic 4.878e-01 4.634e-03 105.248 < 2e-16 ***
## year2016 -3.906e-02 5.896e-03 -6.625 3.56e-11 ***
## year2017 1.383e-01 5.869e-03 23.571 < 2e-16 ***
## year2018 -3.158e-02 9.631e-03 -3.279 0.00104 **
## x4046 -1.074e-07 5.481e-09 -19.593 < 2e-16 ***
## x4225 8.907e-08 5.049e-09 17.640 < 2e-16 ***
## year2016:x4046 -2.122e-10 4.661e-09 -0.046 0.96368
## year2017:x4046 1.194e-08 4.495e-09 2.656 0.00792 **
## year2018:x4046 2.490e-08 6.395e-09 3.894 9.89e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3043 on 18239 degrees of freedom
## Multiple R-squared: 0.4291, Adjusted R-squared: 0.4289
## F-statistic: 1523 on 9 and 18239 DF, p-value: < 2.2e-16
model4pe <- lm(average_price ~ type + year + x4046 + x4225 + year:x4225, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model4pe)
summary(model4pe)
##
## Call:
## lm(formula = average_price ~ type + year + x4046 + x4225 + year:x4225,
## data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32138 -0.18481 -0.01786 0.17825 1.66418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.138e+00 4.846e-03 234.874 < 2e-16 ***
## typeorganic 4.879e-01 4.632e-03 105.312 < 2e-16 ***
## year2016 -4.088e-02 5.911e-03 -6.917 4.78e-12 ***
## year2017 1.349e-01 5.884e-03 22.928 < 2e-16 ***
## year2018 -3.332e-02 9.653e-03 -3.452 0.000557 ***
## x4046 -1.030e-07 4.810e-09 -21.413 < 2e-16 ***
## x4225 7.946e-08 5.553e-09 14.311 < 2e-16 ***
## year2016:x4225 5.904e-09 4.617e-09 1.279 0.201013
## year2017:x4225 2.412e-08 4.888e-09 4.936 8.06e-07 ***
## year2018:x4225 3.299e-08 7.359e-09 4.483 7.39e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3042 on 18239 degrees of freedom
## Multiple R-squared: 0.4296, Adjusted R-squared: 0.4294
## F-statistic: 1527 on 9 and 18239 DF, p-value: < 2.2e-16
model4pf <- lm(average_price ~ type + year + x4046 + x4225 + x4046:x4225, data = trimmed_avocados)
par(mfrow = c(2, 2))
plot(model4pf)
summary(model4pf)
##
## Call:
## lm(formula = average_price ~ type + year + x4046 + x4225 + x4046:x4225,
## data = trimmed_avocados)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32496 -0.18276 -0.01693 0.17719 1.66537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.150e+00 5.066e-03 227.098 < 2e-16 ***
## typeorganic 4.729e-01 4.899e-03 96.524 < 2e-16 ***
## year2016 -3.906e-02 5.734e-03 -6.812 9.92e-12 ***
## year2017 1.414e-01 5.708e-03 24.772 < 2e-16 ***
## year2018 -2.326e-02 9.364e-03 -2.484 0.013 *
## x4046 -1.191e-07 5.241e-09 -22.733 < 2e-16 ***
## x4225 5.857e-08 5.764e-09 10.161 < 2e-16 ***
## x4046:x4225 4.084e-15 4.435e-16 9.209 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3038 on 18241 degrees of freedom
## Multiple R-squared: 0.4311, Adjusted R-squared: 0.4309
## F-statistic: 1975 on 7 and 18241 DF, p-value: < 2.2e-16
So it looks like model4pa with the type:year interaction is the strongest. But, again, the improvement due to the interaction is rather marginal.
Let’s try to fit a predictive model using glmulti()
library(glmulti)
## Loading required package: rJava
This data is pretty big for glmulti on a single CPU core, so we’ll likely not be able to do a search simultaneously for both main effects and pairwise interactions. Let’s look first for the best main effects model using BIC as our metric:
n_data <- nrow(trimmed_avocados)
test_index <- sample(1:n_data, size = n_data * 0.3)
test <- slice(trimmed_avocados, test_index)
train <- slice(trimmed_avocados, -test_index)
# sanity check
nrow(test) + nrow(train) == n_data
## [1] TRUE
nrow(test)
## [1] 5474
nrow(train)
## [1] 12775
glmulti_fit <- glmulti(
average_price ~ .,
data = train,
level = 1, # 2 = include pairwise interactions, 1 = main effects only (main effect = no pairwise interactions)
minsize = 1, # no min size of model
maxsize = 10, # -1 = no max size of model
marginality = TRUE, # marginality here means the same as 'strongly hierarchical' interactions, i.e. include pairwise interactions only if both predictors present in the model as main effects.
method = "h", # the problem is too large for exhaustive search, so search using a genetic algorithm
crit = bic, # criteria for model selection is BIC value (lower is better)
plotty = FALSE, # don't plot models as function runs
report = TRUE, # do produce reports as function runs
confsetsize = 10, # return best 10 solutions
fitfunction = lm # fit using the `lm` function
)
## Initialization...
## TASK: Exhaustive screening of candidate set.
## Fitting...
##
## After 50 models:
## Best model: average_price~1+total_volume+x4225+x4770+small_bags
## Crit= 12363.916422852
## Mean crit= 12373.6443154778
##
## After 100 models:
## Best model: average_price~1+total_volume+x4046+x4770+large_bags
## Crit= 12361.0264058837
## Mean crit= 12368.9393879394
##
## After 150 models:
## Best model: average_price~1+x4046+x4225+x4770+x_large_bags
## Crit= 12360.7811846587
## Mean crit= 12366.9668771096
##
## After 200 models:
## Best model: average_price~1+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 12360.1957060061
## Mean crit= 12365.3102523046
##
## After 250 models:
## Best model: average_price~1+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 12360.1957060061
## Mean crit= 12364.4539336146
##
## After 300 models:
## Best model: average_price~1+type+x4046+x4225+x4770
## Crit= 6551.06679847876
## Mean crit= 6563.55239189649
##
## After 350 models:
## Best model: average_price~1+type+total_volume+x4046+x4770+large_bags
## Crit= 6517.80876604273
## Mean crit= 6532.5589354949
##
## After 400 models:
## Best model: average_price~1+type+total_volume+x4046+x4225+total_bags+small_bags+large_bags
## Crit= 6503.37589291178
## Mean crit= 6512.40367059575
##
## After 450 models:
## Best model: average_price~1+type+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 6494.34331942331
## Mean crit= 6503.87289800822
##
## After 500 models:
## Best model: average_price~1+type+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 6494.34331942331
## Mean crit= 6501.85092412251
##
## After 550 models:
## Best model: average_price~1+type+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 6494.34331942331
## Mean crit= 6501.84998729346
##
## After 600 models:
## Best model: average_price~1+type+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 6494.34331942331
## Mean crit= 6501.84998729346
##
## After 650 models:
## Best model: average_price~1+type+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 6494.34331942331
## Mean crit= 6501.84998729346
##
## After 700 models:
## Best model: average_price~1+type+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 6494.34331942331
## Mean crit= 6501.84998729346
##
## After 750 models:
## Best model: average_price~1+type+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 6494.34331942331
## Mean crit= 6501.84998729346
##
## After 800 models:
## Best model: average_price~1+type+year+x4046+x4225
## Crit= 5838.34205674501
## Mean crit= 6159.78334820428
##
## After 850 models:
## Best model: average_price~1+type+year+total_volume+x4225+small_bags
## Crit= 5799.32125718692
## Mean crit= 5814.71382907454
##
## After 900 models:
## Best model: average_price~1+type+year+total_volume+x4046+x4770+large_bags
## Crit= 5795.30451717002
## Mean crit= 5800.4347486542
##
## After 950 models:
## Best model: average_price~1+type+year+total_volume+x4046+x4225+total_bags+small_bags+large_bags
## Crit= 5784.23142738162
## Mean crit= 5788.47933983094
##
## After 1000 models:
## Best model: average_price~1+type+year+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 5775.80692307433
## Mean crit= 5783.40200843071
##
## After 1050 models:
## Best model: average_price~1+type+year+total_volume+x4225+x4770+small_bags+x_large_bags
## Crit= 5775.80692307433
## Mean crit= 5782.64357011512
## Completed.
summary(glmulti_fit)
## $name
## [1] "glmulti.analysis"
##
## $method
## [1] "h"
##
## $fitting
## [1] "lm"
##
## $crit
## [1] "bic"
##
## $level
## [1] 1
##
## $marginality
## [1] TRUE
##
## $confsetsize
## [1] 10
##
## $bestic
## [1] 5775.807
##
## $icvalues
## [1] 5775.807 5776.721 5784.231 5784.231 5784.232 5784.232 5784.244 5784.246
## [9] 5784.246 5784.246
##
## $bestmodel
## [1] "average_price ~ 1 + type + year + total_volume + x4225 + x4770 + "
## [2] " small_bags + x_large_bags"
##
## $modelweights
## [1] 0.571056066 0.361507796 0.008459036 0.008458737 0.008458706 0.008458653
## [7] 0.008404445 0.008399060 0.008398763 0.008398738
##
## $includeobjects
## [1] TRUE
So the lowest BIC model with main effects is average_price ~ type + year + x4046 + x4225 + x4770 + large_bags + x_large_bags. Let’s have a look at possible extensions to this. We’re going to deliberately try to go to the point where models start to overfit (as tested by the RMSE on the test set), so we’ve seen what this looks like.
results <- tibble(
name = c(), bic = c(), rmse_train = c(), rmse_test = c()
)
# lowest BIC model with main effects
lowest_bic_model <- lm(average_price ~ type + year + x4046 + x4225 + x4770 + large_bags + x_large_bags, data = train)
results <- results %>%
add_row(
tibble_row(
name = "lowest bic",
bic = bic(lowest_bic_model),
rmse_train = rmse(lowest_bic_model, train),
rmse_test = rmse(lowest_bic_model, test)
)
)
# try adding in all possible pairs with these main effects
lowest_bic_model_all_pairs <- lm(average_price ~ (type + year + x4046 + x4225 + x4770 + large_bags + x_large_bags)^2, data = train)
results <- results %>%
add_row(
tibble_row(
name = "lowest bic all pairs",
bic = bic(lowest_bic_model_all_pairs),
rmse_train = rmse(lowest_bic_model_all_pairs, train),
rmse_test = rmse(lowest_bic_model_all_pairs, test)
)
)
# try a model with all main effects
model_all_mains <- lm(average_price ~ ., data = train)
results <- results %>%
add_row(
tibble_row(
name = "all mains",
bic = bic(model_all_mains),
rmse_train = rmse(model_all_mains, train),
rmse_test = rmse(model_all_mains, test)
)
)
# try a model with all main effects and all pairs
model_all_pairs <- lm(average_price ~ .^2, data = train)
results <- results %>%
add_row(
tibble_row(
name = "all pairs",
bic = bic(model_all_pairs),
rmse_train = rmse(model_all_pairs, train),
rmse_test = rmse(model_all_pairs, test)
)
)
## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading
## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading
# try a model with all main effects, all pairs and all triples (this is getting silly)
model_all_triples <- lm(average_price ~ .^3, data = train)
results <- results %>%
add_row(
tibble_row(
name = "all triples",
bic = bic(model_all_triples),
rmse_train = rmse(model_all_triples, train),
rmse_test = rmse(model_all_triples, test)
)
)
## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading
## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading
# try a model with all main effects, all pairs, all triples and all quadruples (definitely silly)
model_all_quadruples <- lm(average_price ~ .^4, data = train)
results <- results %>%
add_row(
tibble_row(
name = "all quads",
bic = bic(model_all_quadruples),
rmse_train = rmse(model_all_quadruples, train),
rmse_test = rmse(model_all_quadruples, test)
)
)
## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading
## Warning in predict.lm(model, data): prediction from a rank-deficient fit may be
## misleading
results <- results %>%
pivot_longer(cols = bic:rmse_test, names_to = "measure", values_to = "value") %>%
mutate(
name = fct_relevel(
as_factor(name),
"lowest bic", "lowest bic all pairs", "all mains", "all pairs", "all triples", "all quads"
)
)
results %>%
filter(measure == "bic") %>%
ggplot(aes(x = name, y = value)) +
geom_col(fill = "steelblue", alpha = 0.7) +
labs(
x = "model",
y = "bic"
)
BIC is telling us here that if we took our main effects model with lowest BIC, and added in all possible pairs, this would likely still improve the model for predictive purposes.
Let’s compare the RMSE values of the various models for train and test sets. We expect train RMSE always to go down as model complexity increases, but what happens to the test RMSE as models get more complex?
results %>%
filter(measure != "bic") %>%
ggplot(aes(x = name, y = value, fill = measure)) +
geom_col(position = "dodge", alpha = 0.7) +
labs(
x = "model",
y = "rmse"
)
Lowest RMSE in test is obtained for the ‘all pairs’ model, and it increases thereafter as the models get more complex. This is a clear indication of overfitting.